Goto

Collaborating Authors

 Santos


Artificial neural networks ensemble methodology to predict significant wave height

Minuzzi, Felipe Crivellaro, Farina, Leandro

arXiv.org Artificial Intelligence

Institute of Mathematics and Statistics, Federal University of Rio Grande do Sul (UFRGS), Av. Center for Coastal and Oceanic Geology Studies (CECO), Federal University of Rio Grande do Sul (UFRGS), Av. Abstract The forecast of wave variables are important for several applications that depend on a better description of the ocean state. Due to the chaotic behaviour of the differential equations which model this problem, a well know strategy to overcome the difficulties is basically to run several simulations, by for instance, varying the initial condition, and averaging the result of each of these, creating an ensemble. Moreover, in the last few years, considering the amount of available data and the computational power increase, machine learning algorithms have been applied as surrogate to traditional numerical models, yielding comparative or better results. In this work, we present a methodology to create an ensemble of different artificial neural networks architectures, namely, MLP, RNN, LSTM, CNN and a hybrid CNN-LSTM, which aims to predict significant wave height on six different locations in the Brazilian coast. The networks are trained using NOAA's numerical reforecast data and target the residual between observational data and the numerical model output. A new strategy to create the training and target datasets is demonstrated. Introduction Numerical simulations of both weather and ocean parameters rely on the evolution of nonlinear dynamical systems that have a high sensitivity on initial conditions. Considering that errors in the observations and analysis are present, and therefore in the initial conditions, the concept of a unique deterministic solution of the governing equations becomes fragile [1, 2].


Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

Wong, Brian Shing-Hei, Kim, Joshua Mincheol, Fung, Sin-Hang, Xiong, Qing, Ao, Kelvin Fu-Kiu, Wei, Junkang, Wang, Ran, Wang, Dan Michelle, Zhou, Jingying, Feng, Bo, Cheng, Alfred Sze-Lok, Yip, Kevin Y., Tsui, Stephen Kwok-Wing, Cao, Qin

arXiv.org Artificial Intelligence

Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non-allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein sequence characteristics, is crucial for Applm's performance by detecting important differences among protein sequences. In addition to providing Applm as open-source software, we also provide our carefully curated benchmark datasets to facilitate future research.


Expected Possession Value of Control and Duel Actions for Soccer Player's Skills Estimation

Shelopugin, Andrei

arXiv.org Artificial Intelligence

Estimation of football players' skills is one of the key tasks in sports analytics. This paper introduces multiple extensions to a widely used model, expected possession value (EPV), to address some key challenges such as selection problem. First, we assign greater weights to events occurring immediately prior to the shot rather than those preceding them (decay effect). Second, our model incorporates possession risk more accurately by considering the decay effect and effective playing time. Third, we integrate the assessment of individual player ability to win aerial and ground duels. Using the extended EPV model, we predict this metric for various football players for the upcoming season, particularly taking into account the strength of their opponents.


Biclustering a dataset using photonic quantum computing

Borle, Ajinkya, Bhave, Ameya

arXiv.org Artificial Intelligence

Biclustering is a problem in machine learning and data mining that seeks to group together rows and columns of a dataset according to certain criteria. In this work, we highlight the natural relation that quantum computing models like boson and Gaussian boson sampling (GBS) have to this problem. We first explore the use of boson sampling to identify biclusters based on matrix permanents. We then propose a heuristic that finds clusters in a dataset using Gaussian boson sampling by (i) converting the dataset into a bipartite graph and then (ii) running GBS to find the densest sub-graph(s) within the larger bipartite graph. Our simulations for the above proposed heuristics show promising results for future exploration in this area.


Comparing Computational Architectures for Automated Journalism

Sym, Yan V., Campos, João Gabriel M., José, Marcos M., Cozman, Fabio G.

arXiv.org Artificial Intelligence

The majority of NLG systems have been designed following either a template-based or a pipeline-based architecture. Recent neural models for data-to-text generation have been proposed with an end-to-end deep learning flavor, which handles non-linguistic input in natural language without explicit intermediary representations. This study compares the most often employed methods for generating Brazilian Portuguese texts from structured data. Results suggest that explicit intermediate steps in the generation process produce better texts than the ones generated by neural end-to-end architectures, avoiding data hallucination while better generalizing to unseen inputs. Code and corpus are publicly available.


Performance Evaluation of DCA and SRC on a Single Bot Detection

Al-Hammadi, Yousof, Aickelin, Uwe, Greensmith, Julie

arXiv.org Artificial Intelligence

Malicious users try to compromise systems using new techniques. One of the recent techniques used by the attacker is to perform complex distributed attacks such as denial of service and to obtain sensitive data such as password information. These compromised machines are said to be infected with malicious software termed a "bot". In this paper, we investigate the correlation of behavioural attributes such as keylogging and packet flooding behaviour to detect the existence of a single bot on a compromised machine by applying (1) Spearman's rank correlation (SRC) algorithm and (2) the Dendritic Cell Algorithm (DCA). We also compare the output results generated from these two methods to the detection of a single bot. The results show that the DCA has a better performance in detecting malicious activities.


Detecting Motifs in System Call Sequences

Wilson, William O., Feyereisl, Jan, Aickelin, Uwe

arXiv.org Artificial Intelligence

The search for patterns or motifs in data represents an area of key interest to many researchers. In this paper we present the Motif Tracking Algorithm, a novel immune inspired pattern identification tool that is able to identify unknown motifs which repeat within time series data. The power of the algorithm is derived from its use of a small number of parameters with minimal assumptions. The algorithm searches from a completely neutral perspective that is independent of the data being analysed, and the underlying motifs. In this paper the motif tracking algorithm is applied to the search for patterns within sequences of low level system calls between the Linux kernel and the operating system's user space. The MTA is able to compress data found in large system call data sets to a limited number of motifs which summarise that data. The motifs provide a resource from which a profile of executed processes can be built. The potential for these profiles and new implications for security research are highlighted. A higher level call system language for measuring similarity between patterns of such calls is also suggested.


DCA for Bot Detection

Al-Hammadi, Yousof, Aickelin, Uwe, Greensmith, Julie

arXiv.org Artificial Intelligence

Abstract-- Ensuring the security of computers is a nontrivial task, with many techniques used by malicious users to compromise these systems. In recent years a new threat has emerged in the form of networks of hijacked zombie machines used to perform complex distributed attacks such as denial of service and to obtain sensitive data such as password information. These zombie machines are said to be infected with a'bot' - a malicious piece of software which is installed on a host machine and is controlled by a remote attacker, termed the'botmaster of a botnet'. In this work, we use the biologically inspired Dendritic Cell Algorithm (DCA) to detect the existence of a single bot on a compromised host machine. The DCA is an immune-inspired algorithm based on an abstract model of the behaviour of the dendritic cells of the human body. The basis of anomaly detection performed by the DCA is facilitated using the correlation of behavioural attributes such as keylogging and packet flooding behaviour. The results of the application of the DCA to the detection of a single bot show that the algorithm is a successful technique for the detection of such malicious software without responding to normally running programs. Computer systems and networks come under frequent attack from a diverse set of malicious programs and activity. Computer viruses posed a large problem in the late 1980's and computer worms were problematic in the 1990s through to the early 21st Century. While the detection of such worms and viruses is improving a new threat has emerged in the form of the botnet. Botnets are decentralised, distributed networks of subverted machines, controlled by a central commander, affectionately termed the'botmaster'. A single bot is a malicious piece of software which, when installed on an unsuspecting host, transforms host into a zombie machine.